Skip to content

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel#10

Open
czoli1976 wants to merge 75 commits into
base/sonos-mainfrom
feat/avx512fp16-native-f16
Open

linalg/x86_64: AVX-512_FP16 native f16 hardswish kernel#10
czoli1976 wants to merge 75 commits into
base/sonos-mainfrom
feat/avx512fp16-native-f16

Conversation

@czoli1976

Copy link
Copy Markdown
Owner

Summary

Adds a native AVX-512_FP16 path for the f16 element-wise activations on Sapphire Rapids / Granite Rapids / later Intel parts. Computes f16 directly in zmm registers (32 lanes per zmm) using vmulph / vfmadd*ph / vminph / vmaxph / vaddph — no vcvtph2ps / vcvtps2ph round-trip, no f32 scratch buffer.

Plugged via a new plug_avx512fp16(ops) step that runs after plug_avx512f on hosts where is_x86_feature_detected!("avx512fp16") is true. Pre-FP16 AVX-512 hosts (Skylake-X, Cascade Lake, Ice Lake server prior to the fp16 extension) keep using the existing f32-roundtrip kernels from act_f16.rs unchanged.

Stacks on the existing AVX-512 set (#3, #4, #5, #6, #7, #8, #9 in this fork; sonos#2303-sonos#2307 on sonos/tract for the upstreamed ones, plus the still-on-fork f16 activations and RmsNorm PRs).

What's in this PR

Kernel Plugged? Why
hardswish_f16_128n ✅ Yes 3.62× faster than the f32-roundtrip version on SR
leaky_relu_f16_128n ❌ No, kept as code-only 0.62× — actually regressed vs f32-roundtrip on SR. Native code retained for future revisit on different fp16 uarchs (Granite Rapids may flip the result).

Other f16 activations (sigmoid, tanh, silu, gelu) are not ported here — they require polynomial approximations whose precision in native f16 (11-bit mantissa) needs separate validation. Out of scope for this PR.

Kernel-level bench (Sapphire Rapids, n=1024, single thread, Criterion)

hardswish_f16
  generic              52.3 Melem/s
  avx512_f32roundtrip   8.71 Gelem/s    ← current czoli1976#8 baseline
  avx512fp16_native    31.6  Gelem/s    ← this PR (3.62× over f32-roundtrip)

leaky_relu_f16
  generic             778   Melem/s
  avx512_f32roundtrip   9.44 Gelem/s    ← current czoli1976#8 baseline
  avx512fp16_native     5.85 Gelem/s    ← this PR's kernel, NOT plugged (regression)

Bench file: linalg/benches/activations_avx512_fp16.rs (new).

Why does leaky_relu regress?

Best read: lightweight 2-op-per-element compute paths (vmulph + vmaxph) don't saturate Sapphire Rapids' FP16 execution port the same way the equivalent f32 ops saturate the FP32 ports. The native kernel's instruction-per-element count is genuinely lower than the f32-roundtrip path (0.125 vs 0.375 ops/element by static count) but the f32 path achieves higher actual throughput per cycle. Could be: f16 mul/max throughput is 1 CPI vs f32's 0.5 CPI on this uarch; or the round-trip is benefiting from out-of-order overlap of the conversion with subsequent compute. I didn't investigate microarchitecturally past confirming the kernel is correct.

For hardswish, the per-iteration compute is ~6× heavier (4 add + 4 min + 4 max + 8 mul per chunk), enough to saturate the FP16 ports, and the 2× lane-count advantage of native fp16 dominates.

Test plan

  • cargo test --release -p tract-linalg --lib act_f16_fp16 — 4 passed, 0 failed (2 ops × 2 cases: trivial + proptest against scalar f16 reference)
  • cargo test --release -p tract-linalg --lib — 2845 passed, 0 failed
  • cargo test --release -p tract-core --lib — 245 passed, 0 failed
  • cargo bench --bench activations_avx512_fp16 — numbers above
  • cargo fmt --all -- --check clean
  • cargo clippy --release -p tract-linalg clean on the new file (no act_f16_fp16 warnings)
  • Cross-arch: cargo check --target aarch64-unknown-linux-gnu and --target wasm32-unknown-unknown both clean (plug_avx512fp16 is x86_64-only and runtime-feature-gated)

Dependencies

Built on top of the merged-all-7 base branch (which has sonos#2303-sonos#2307 plus the not-yet-upstream f16 activations and RmsNorm work). The native fp16 kernel only fires on top of the f32-roundtrip path from #8plug_avx512fp16 overrides one slot that plug_avx512f set just above. If the upstream f16 activations PR (#8) lands first, this PR rebases trivially.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com


Generated by Claude Code

kali and others added 13 commits May 28, 2026 11:00
`TypedModelPatch::shunt_outside` leaves the shunted node in the graph,
but the NNEF `patch` transform also implicitly removed model inputs
whose name appeared on the LHS.  That hidden side-effect made
`patch` do two things at once: substitute a wire, and trim the
interface.  Drop the trimming.

Add a sibling `select_inputs(inputs: [...])` transform shaped like
`select_outputs`.  The pulse pipeline now reads:

  -t 'patch(body: "length = tract_core_shape_of(input_signal)[1];")' \
  -t 'select_inputs(inputs: ["input_signal"])'              \
  -t 'select_outputs(outputs: ["processed_signal"])'        \
  -t 'pulse(symbol: ..., pulse: ...)'

Discarded Sources stay in the graph until declutter prunes them.

Wire-up: `Graph::select_inputs_by_name` (mirror of
`select_outputs_by_name`) + `with_inputs_by_name` + transform
registration.  Updated harness/nemotron + nemo-nemotron-asr +
nemo-nemotron-streaming-asr to add the explicit `select_inputs` step.
The 'without-default-features' job in full.yml (cargo check -p tract-cli
--no-default-features) regressed after the cuda-12XXX split: cudarc and
tract-cuda were still pulled in unconditionally on linux/windows targets,
so stripping the cuda-13000 default left cudarc with no API-version
feature and its build script panicked.

Make both deps optional in tract-cli and tract-libcli, and have each
cuda-XXXXX feature pull them in (dep:cudarc + dep:tract-cuda +
tract-cuda/cuda-XXXXX + tract-libcli/cuda).  Adds a marker 'cuda'
feature so cudarc-touching code in bench.rs / dump.rs / libcli/lib.rs
can gate cleanly.

test-cuda explicitly opts into cuda-13000 (workspace dep has
default-features=false now), so 'cargo test -p tract-cuda -p test-cuda'
keeps building.
Unify the four overlapping names for 'bind a symbol to a value across
the model graph' under one verb:

  - core: `TypedModel::substitute_symbols` → `set_symbols`
  - core: `TypedOp::substitute_symbols` trait method → `set_symbols`
  - transform name: `concretize_symbols(values: …)` → `set_symbols(values: …)`
  - Rust API: `ConcretizeSymbols` → `SetSymbols`
  - Python API: `tract.ConcretizeSymbols` → `tract.SetSymbols`

The CLI `--set B=1` flag was already aligned and is unchanged.  No
deprecation aliases — hard rename across cli, harness scripts, examples
and Python bindings.

The Rust API builder gains a `SetSymbols::expr(name, str)` companion
to `value(name, i64)` so callers can pass TDim expressions (e.g.
`'2*S'`) the way the CLI `--set` and the transform already do.

`TDim::substitute` / `TDim::substitute_all` are unchanged: they
operate on a single TDim expression, not on the model, and "substitute"
is the accurate verb for that level.
The top-level `--set` flag was already TDim-aware via `parse_set_subs`
in params.rs; the `run` subcommand had a parallel `--set` flag that
only accepted plain i64.  Parse RHS as a TDim against the model's
symbol scope and reduce to i64 with the symbols set so far on the
command line, so `run --set FOO=2 --set T=2*FOO` resolves cleanly.

Order is CLI-significant: a symbol referenced on the RHS must be set
to its left.  Errors out with the unresolved name in the message.
The optimized Scan body runs the same plan with the same shapes every
timestep, so resolve its symbols once, reset between iters without
discarding them (reset_turn_keep_symbols), and reuse one drained input
buffer -- instead of a full model_state.run() cycle (set_inputs ->
resolve_symbols -> exec -> outputs -> reset_turn) per timestep.

Bit-identical to the old path across GRU/LSTM/RNN + df_dec. No measurable
wall-clock impact on fixed main (within +/-1% noise on gru/lstm/rnn 128/50
& 256/100 and df_dec, single-thread); kept as a cleanup of the per-iter
re-entry path, not as a perf change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefill-only GroupQueryAttention lowered onto tract Sdpa: reshapes Q/K/V to
4D, applies an explicit lower-triangular causal mask, and returns
present_key/present_value (the reshaped K/V). Sdpa handles the grouped-query
head sharing (kv_num_heads < num_heads). Decode-step KV cache, internal
rotary (do_rotary), local-window attention and softcap are rejected with
clear errors.

Validated against onnxruntime across head_size 8/16/64, several
num_heads/kv_num_heads ratios (incl. multi-query kv=1) and batch>1: attention
output matches to <=3.6e-7 and present_key/present_value are bit-exact.

ORT's GroupQueryAttention prefill is standard causal grouped-query attention;
the seqlens_k input is the 0-indexed position of the last token
(total_sequence_length - 1), not the token count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the resolved dependency graph so debug builds, release artifacts,
SBOMs and security audits all see the same versions.  CI runs against
this lockfile; `cargo update` is the explicit knob for bumping deps.
For each tract-cli release artifact (per target triple), generate both
CycloneDX and SPDX SBOMs from the workspace (Cargo.lock-driven) via
anchore/sbom-action and upload them alongside the .tgz.  Also pass
--locked to the release build so the SBOM matches the resolved deps
exactly.

The sbom-action ref is currently the v0.18.0 tag — dependabot
github-actions runs weekly and will SHA-pin on its next pass.
Each tract-cli release tarball now gets two GitHub attestations
(CycloneDX and SPDX SBOM, via actions/attest-sbom).  Anyone can
verify after download with:

  gh attestation verify tract-<triple>-<version>.tgz --owner sonos

Requires `id-token: write` + `attestations: write` on the job.
sbom-action's `upload-release-assets: false` keeps the SBOM files
out of its own upload path so the explicit softprops step is the
single source of release artifacts.
- cargo auditable wraps the release build so the resolved dependency
  graph lands inside the binary itself.  Consumers can recover the
  SBOM with `cargo audit bin tract` without needing the published
  .cdx.json / .spdx.json files.
- actions/attest-build-provenance@v2 signs the .tgz with provenance
  metadata (workflow ref, commit SHA, runner).  Combined with the
  existing SBOM attestations this lands at SLSA Build Level 3.
Pinned commits (latest stable as of writing):
- anchore/sbom-action @ e22c389 (v0.24.0)
- actions/attest-sbom @ c604332 (v4.1.0)
- actions/attest-build-provenance @ a2bbfa2 (v4.1.0)

Matches the existing SHA + comment convention used for
actions/checkout and softprops/action-gh-release; dependabot's
github-actions group will keep them current.
Two-part change so consumers can audit the deps that landed in the
tract Python wheel without needing to re-clone the Rust workspace:

1. `api/py/pyproject.toml` (Linux + macOS cibuildwheel before-build):
   install cargo-auditable and write a one-line bash shim that
   prefixes `auditable` to every cargo invocation.  setuptools_rust
   honours $CARGO (build.py:97), so pointing CARGO at the shim makes
   the Rust .so inside the wheel carry its dep graph in the
   `.dep-v0` ELF/Mach-O section.  Windows wheels stay as-is for now
   (TODO comment).

2. `.github/workflows/wheels.yml` + `.github/scripts/inject_wheel_sboms.py`:
   after cibuildwheel emits each .whl, install syft (via
   anchore/sbom-action/download-syft, SHA-pinned), unpack the wheel,
   scan its contents (syft's rust-audit-binary cataloger reads the
   embedded cargo-auditable section), drop sbom.cdx.json +
   sbom.spdx.json into `<dist-info>/sboms/` per PEP 770, and
   re-pack via `wheel pack` (which regenerates RECORD with hashes).

Smoke-tested locally on a sample wheel: SBOMs end up at the right
path and RECORD has correct sha256 entries.
@czoli1976 czoli1976 force-pushed the feat/avx512fp16-native-f16 branch from 15d837e to c1bd5bc Compare May 28, 2026 16:14
atty (0.2.x) is unmaintained and triggers RUSTSEC-2021-0145 on SBOM
audits.  It's only used in two places — both `is stderr a TTY`
checks in `tract hwbench` — and std::io::IsTerminal (stable since
1.70, well below tract's MSRV) is a drop-in.

`cargo tree -i atty` after the change reports the crate is no longer
in the workspace dep graph.
@czoli1976 czoli1976 force-pushed the feat/avx512fp16-native-f16 branch from c1bd5bc to d40fd79 Compare May 29, 2026 08:13
kali and others added 14 commits May 29, 2026 10:25
runtime_for_name("gpu")        → first GPU backend whose `check()`
                                 passes (metal, then cuda); error if
                                 none are available.
runtime_for_name("gpu-or-cpu") → same lookup, but falls through to
                                 the `default` CPU runtime instead
                                 of erroring.

No new mechanism — both names walk the existing inventory and use each
backend's existing `check()` to decide availability.  Backend-specific
names (`cuda`, `metal`) still work as before.
…or_name

The CPU runtime now reports its own name as `cpu` (which is what it
is), so `list-runtimes` shows `cpu`, `cuda`, `metal` … instead of
the misleading `default`.

Back-compat for callers passing `default` is handled by a one-line
alias in `runtime_for_name` rather than by registering two runtimes
or by polluting the trait — the alias only affects name lookup, not
the inventory.
The `tensorflow` 0.21.0 crate (Rust binding for libtensorflow) was
only pulled in behind the dead `conform` cargo feature — which gated
`tract compare --tf` (compare tract output against running on
libtensorflow on the same model).  The feature isn't enabled in any
GitHub workflow; only a stranded `.travis/tf.sh` ever ran it.

The upstream `tensorflow` crate hasn't shipped since 2023-08-15 and
pins to rust-protobuf 2.27.x, which trips RUSTSEC-2024-0437.  Drop
the feature and all its plumbing.

Tract's own `.pb` parsing (used by `-t transformers_detect_all` and
the `tf` cargo feature in tract-cli) goes through prost and is
unaffected — the `tract-tensorflow` crate stays, just without the
libtensorflow runtime.  Cargo.lock shrinks by ~350 lines as a
side-effect.
The LayerNorm op's `wire` expansion casts `normalized` back to
fact.datum_type *before* applying scale/bias, then multiplies that
result with `cast_scale` (which is still in self.datum_type, F32).

With F16 inputs this becomes F16 × F32, whose output is downgraded to
F32 by `mul()`. The inference rule then asserts
`outputs[0].datum_type == inputs[0].datum_type` (F16) against the
actual F32 output, failing `into_typed()` with:

    Output mismatch after rewiring expansion for output #0:
    expected 1,256,384,F16 got 1,256,384,F32

Fix: defer the cast back to fact.datum_type until after all scale/bias
operations. Now the expansion stays entirely in self.datum_type (F32)
through normalized × scale + bias, and casts only the final result.

Behavior is unchanged for F32 inputs (the final cast is a no-op when
fact.datum_type == self.datum_type).

Reproduced with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
exported via `optimum.exporters.onnx.main_export(..., dtype="fp16")`
and loaded with `into_optimized().into_runnable()`.
The single-thread MMM tile walk used a naive nested loop, re-streaming the
full inner operand (all of A in col-outer / B in row-outer) per panel at
large k, which is memory/L1-bound. The multithread path already 2D-blocks the
panel grid (chunk_grid); this brings the same blocking to the single-thread
path, with the block edge cache-derived (detected L2/3, conservative 256 KiB
fallback) so it stays L2-resident across hardware and never over-blocks a
cache it cannot see.

Bit-identical: it only reorders independent tiles (each computes its full-k
reduction into a disjoint C region). The block-edge floor of 1 degrades
exactly to the naive loop; the cap of 16 matches the multithread chunk_grid
blocking already shipped on all platforms. Frame-level, so all kernels
benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV /
multithreaded shapes are unchanged.

Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against
the naive reference (the existing frame proptests only reach 3 panels).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant
(it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt
--check` failed in CI. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Passing 'v0.24.0' to post-release.sh writes 'version = "v0.24.0"' into
every Cargo.toml — invalid semver, breaks the workspace, easy to do by
muscle memory because the git tag does carry the 'v' prefix. Bail out
early in both release.sh and post-release.sh when the argument doesn't
match an unprefixed semver.
actions/checkout runs with persist-credentials: false, so the bare
'git push origin gh-pages' had no auth and failed with 'could not
read Username'. Use the workflow's GITHUB_TOKEN in the remote URL
instead — keeps zizmor happy while letting the deploy step push.
kali and others added 29 commits June 3, 2026 09:10
…os#2332)

The transform conflated two concerns: flipping external_state (a fixup for
NNEF artifacts predating the flag) and substituting the scan-axis symbol with
1 model-wide. The latter is the caller's per-call seq=1 contract, needed only
for declutter_single_loop's separate iters==1 gate, not implied by external
state. Keep the transform flag-only; harnesses now drive inlining with an
explicit -t set_symbols (T / TARGETS__TIME = 1) alongside the flag.
…omment (sonos#2334)

Supersedes dependabot sonos#2333. Pins the v6.1.2 commit (acca2b1b) and replaces
the floating '# v6' comment with '# v6.1.2'. zizmor's unpinned-uses flags a
hash pin whose comment tag no longer resolves to the pinned SHA; a major-only
tag drifts on every patch release, so use the exact version tag (immutable,
matches the SHA).
* build: sync Cargo.lock workspace versions to 0.23.1-pre

post-release 0.23.1-pre (688b476) bumped the crate manifests but left
Cargo.lock at 0.23.0, so every cargo build rewrote these 24 workspace-member
version entries and showed a spurious modified Cargo.lock.

* release: sync Cargo.lock in post-release version bump

post-release.sh edited manifests via tomato (which never invokes cargo) and
committed without regenerating Cargo.lock, shipping a lock that mismatched the
bumped versions. Add 'cargo update --workspace' before the commit. release.sh
avoids this incidentally because cargo publish re-resolves the lock first.
com.microsoft.RotaryEmbedding is identical math to the standardized
ai.onnx op but orders its inputs (input, position_ids, cos, sin). tract
resolves ops by name regardless of domain, so make the single handler
domain-aware and remap inputs accordingly. Rejects the contrib-only
scale != 1.0 and is_packed_batching attributes with clear errors.

Verified bit-exact against onnxruntime (3D, 4D, interleaved); ai.onnx
RotaryEmbedding conformance unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Curated behavioral subset of AGENTS.md (fmt, commit/comment style, model-edit
tooling, public API, test placement, PR etiquette) inlined so it lands in the
auto-loaded context; AGENTS.md stays the full reference.
The single "default to none" rule was suppressing doc comments along with
inline narration. Scope the austerity to inline comments and add a doc-comment
section that encourages concise item docs (contract, inputs, rule interactions)
while keeping the same no-benchmarks/no-history rule.
…c + seq-len lowering heuristic

P·V is computed as one contiguous tile GEMM (`s.dot(&vblock)`) instead of
`head_dim` strided per-column dots; the strided column access defeated
vectorization. Bit-exact (max_abs = 0 vs a naive softmax(QKᵀ·scale)·V ref).

The independent (batch, q-head) tasks now run across cores on rayon's global
pool — heads share only read-only Q/K/V and write disjoint output slices.
Disable with TRACT_FLASH_SDPA_ST=1; single-threaded on wasm. The op scales
~5x across an Apple M1 Pro's 6 performance cores (compute-bound, not memory-
bound in that range).

Sdpa::codegen gains a sequence-length heuristic: an f32 SDPA whose K/V length
is below TRACT_FLASH_SDPA_MIN_SEQ_LEN lowers to the decomposed matmul+softmax
path instead of FlashSdpaOp. Default 0 keeps flash for every length (with head
parallelism it beat the decomposed path at every size measured, 128–4096);
raise it on hosts where short-sequence decompose wins.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wheels host setup now provisions from rust-toolchain.toml instead of a
hardcoded dtolnay stable ref; the fmt check and the contributor rule move
off pinned 1.91.0 rustfmt to stable, matching what the toml selects locally.
crates.yml reads rust-version from the workspace manifest and feeds it to
the test matrix and the cuda-minimum-deploy job, so the MSRV lives in one
place. cross-platform and examples drop their pinned-toolchain override and
inherit stable from rust-toolchain.toml; MSRV stays covered by the crates
matrix.
…st-toolchain.toml

ci-system-setup.sh and native.sh forced RUSTUP_TOOLCHAIN=1.91.0 when unset,
which overrode the toml across full.yml, large_models and cross.sh. Dropping
the default lets them use stable; a caller-set RUSTUP_TOOLCHAIN is still honored.
The pinned SHA de0fac2e is actions/checkout v6.0.2, but the comments read
# v6; that tag has since moved to df4cb1c, so zizmor flagged the mismatch.
Label the exact release across the workflows this branch touches.
…kflows

Same stale # v6 comment on the de0fac2e pin in the workflows this branch
had not yet touched. Verified via the GitHub API that every other pinned
action's version comment still resolves to its SHA; only checkout had drifted.
Add six f16 element-wise activations on x86 AVX-512: sigmoid_f16, tanh_f16,
hardswish_f16, leaky_relu_f16, silu_f16, gelu_f16. Each kernel chunks the
input through a 64-byte-aligned f32 scratch (CHUNK=256), dispatches to the
matching f32 AVX-512 kernel (the avx512_sigmoid_f32 / avx512_tanh_f32
wrappers, or the act:: hardswish / leaky_relu / silu / gelu kernels), and
converts back to f16. silu and gelu compose sigmoid_f32 / tanh_f32 with the
final combine done in f32.

The f16 <-> f32 conversion is driven by vcvtph2ps / vcvtps2ph via std::arch
intrinsics (cvt_f16_to_f32 / cvt_f32_to_f16 helpers); rustc + LLVM do not
autovectorize the scalar f16::to_f32 / f16::from_f32 loops, which is why a
naive port leaves AVX-512 stuck at ~7 Melem/s.

Wires into Ops::{sigmoid,tanh,hardswish,leaky_relu,silu,gelu}_f16 from
plug_avx512f; non-AVX512 x86 keeps the generic scalar f16 kernels. Validated
against the generic H<Op>8 reference via the existing *_frame_tests! macros
at SuperApproximate tolerance, which covers the precision delta between
scalar f16 arithmetic and f32-internal computation.

Measured on Cascade Lake (single-thread, throughput Gelem/s):
  - sigmoid_f16:    0.016 -> 1.54   (96x)
  - tanh_f16:       0.018 -> 1.61   (92x)
  - hardswish_f16:  0.051 -> 9.46   (186x)
  - leaky_relu_f16: 0.96  -> 10.4   (11x; generic baseline is unexpectedly fast)
  - silu_f16:       0.20  -> 0.93   (4.6x)
  - gelu_f16:       0.11  -> 0.75   (6.7x)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds sme_qmmm_i32_32x32, a 32x32 int8->i32 matmul kernel using SME2 SMOPA
(i8 outer-product), selected for qmmm_i32 when FEAT_SME2 is present. Consumes
the same K=4-inner PackedI8K4 packing as the SDOT kernel and implements the
int8 quant fuse ops (q_scale / rounding-shift / shift-left) bit-exactly
(spill-ZA->scratch->reload); only LeakyRelu is unsupported.

Builds on the SDOT kernel (sonos#2278) and dispatch fix (sonos#2277): needs PackedI8K4
plus the matmul/conv lowering from sonos#2278.

sme_qmmm 114/114 on M4 (SME2, SVL=512), bit-exact vs the NEON kernels.
core 244/244, linalg 3931/3931. Assembles + gates off cleanly on non-SME2
arm64 (kernel present, runtime-gated; M1 build + regression green).

Apple M4 e2e, single MatMulInteger, vs the SDOT kernel (sonos#2278):
1024^3 4.67->0.95 ms (4.9x), 512^3 0.70->0.17 ms (4.0x),
128x768x3072 1.80->1.17 ms (1.54x), 32x2048x2048 1.33->0.90 ms (1.47x).
Wash on small/overhead-bound matmuls (MiniLM/InceptionV1 seq=128); the win
is compute-bound int8 GEMM (large batch/hidden, LLM prompt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add zmm (16-wide) implementations of softmax2-fastcompact and max-reduce,
overriding the FMA versions when avx512f is present; non-AVX512 x86 unchanged.

Measured on Cascade Lake (single-thread): +16% on max-reduce and +54% on
exp+sum vs the existing FMA assembly paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add x86_64_avx512_erf_f32_64n: AVX-512 (zmm, 16-wide) erf kernel mirroring
generic/erf.rs::serf (Abramowitz & Stegun 7.1.26 six-coefficient approximation),
processing 64 lanes per iteration via 4 zmm registers and FMA Horner chains.
Wires into Ops::erf_f32 from plug_avx512f; non-AVX512 x86 keeps the generic
scalar path.

Also introduces linalg/src/frame/erf.rs with the erf_frame_tests! macro
(mirrors frame/hardswish.rs structure) so both the generic and AVX-512
implementations share a single proptest reference.

Measured on Cascade Lake (single-thread): ~4x over the autovectorized generic
scalar on hosts with FMA; the gap is significantly larger on pre-FMA x86
where the compiler can't autovectorize the polynomial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add x86_64_avx512_softmax2_fastcompact_f16_64n: AVX-512 (zmm, 16-wide) f16
softmax_l2 mirroring the f32 fast-compact-exp algorithm. Each iteration
processes 64 f16 (128 bytes) through 4x (vcvtph2ps load, zmm f32 compute,
vcvtps2ph store). The sum is accumulated in f32 across the loop (higher
precision than the generic HSoftMaxL2 which accumulates in f16) and cast to
f16 at return; the SuperApproximate test tolerance covers the precision delta.

Wires into Ops::softmax2_fastcompact_f16 from plug_avx512f; non-AVX512 x86
keeps the generic scalar f16 path.

Measured on Cascade Lake (single-thread): ~112x over the generic scalar f16
path (75 Melem/s -> 8.4 Gelem/s). The generic baseline has per-element f16
arithmetic plus a scalar fast_compact_exp_f32 call; the AVX-512 path does the
exp 16-wide in f32 and pays only the IO-boundary conversion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pad had no GPU implementation, so any model using it (e.g. the Nemotron/Parakeet
preprocessor's signal-centering and real-to-complex pads) bounced through the host
mid-graph. Add GpuPad, a backend-agnostic copy-based op: broadcast the pad value
across the output, then copy the input into the interior, both via the existing
copy_nd. Constant mode only; Reflect/Edge stay on the host. The preprocessor signal
path now runs entirely on GPU.
Add a linalg-side fused row-wise RmsNorm primitive
(`tract_linalg::ops().rms_norm_f32`) that replaces tract-core's 4-call
composition (`MeanOfSquares` + `Add` + `Rsqrt` + `Mul`) with a single
two-pass kernel: sum-of-squares via 4 zmm FMA accumulators, scalar reduce
+ rsqrt, then multiply-back via 4 zmm broadcast-multiplies. Scalar tail
handles the remainder when row_len % 64 != 0; vmovups is used throughout
since per-row slices from a tensor are not guaranteed 64-byte aligned.

`core::ops::nn::RmsNorm::eval` gains a fast path for F32 / F16 inputs
where the normalised axis is the last (contiguous) one — it iterates row
by row and dispatches to the linalg primitive. Other shapes (non-trailing
axis) keep the original composition. Generic scalar fallback ships
alongside the AVX-512 kernel; non-x86 and non-AVX-512 x86 keep the scalar
version, which is itself ~equivalent to the composed path because both
are memory-bandwidth bound.

CUDA and Metal already expose a fused `rms_norm` kernel
(`cuda/src/kernels/nn/rms_norm.rs`, `metal/src/kernels/nn/rms_norm.rs`);
this closes the CPU side of the same gap.

Measured on Cascade Lake (single-thread, kernel-level, throughput Gelem/s):
  - row 1024:  0.77 (composed) -> 12.4 (AVX-512)   16.2x
  - row 2048:  0.77            -> 13.8             17.9x
  - row 4096:  0.77            -> 13.8             17.9x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a native f16 hardswish kernel using avx512fp16 ISA (Sapphire Rapids /
Granite Rapids / later Intel). 128 f16 lanes per iteration via 4 zmm of 32 f16
each, processed with vaddph / vminph / vmaxph / vmulph — no f32 round-trip,
no vcvtph2ps/vcvtps2ph at the IO boundary.

Wired through a new `plug_avx512fp16` step that runs after `plug_avx512f` on
hosts where `is_x86_feature_detected!("avx512fp16")` is true. The f32-roundtrip
hardswish_f16 kernel from `act_f16.rs` remains in place as the avx512f-only
fallback (Skylake-X, Cascade Lake, Ice Lake server prior to fp16 extension).

Bench on Sapphire Rapids (n=1024, single thread, Criterion):
  hardswish_f16:
    generic              52.3 Melem/s
    avx512_f32roundtrip   8.71 Gelem/s   (current #8 path)
    avx512fp16_native    31.6 Gelem/s   (this PR, 3.62× over the roundtrip)

A native leaky_relu_f16 kernel is also included but NOT wired — on Sapphire
Rapids it benched 38% slower than the f32-roundtrip version (5.85 vs 9.44
Gelem/s). The two-op-per-element compute path (vmulph + vmaxph) does not
saturate the FP16 execution port the same way the equivalent f32 ops saturate
the FP32 ports. Kernel is correct (4/4 frame tests pass, including proptest
against the f16 reference); kept in the source for future revisit on different
fp16 uarchs where the comparison might flip.

Tests: linalg 2845 passed, 0 failed (+4 new frame tests). Cross-arch
`cargo check` clean on aarch64-unknown-linux-gnu and wasm32-unknown-unknown
(plug_avx512fp16 is x86_64-only and feature-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kali kali force-pushed the feat/avx512fp16-native-f16 branch from d40fd79 to 69172ed Compare June 8, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants